Goto

Collaborating Authors

 training run






A Appendix

Neural Information Processing Systems

Perplexity vs. FLOP count of MIM compared to left-to-right baselines across model sizes. To evaluate the effectiveness of "Meet in the Middle" (MIM) pre-training compared to left-to-right Perplexity vs. training time of MIM compared to left-to-right baselines across model sizes. Our largest models of size 2.7B parameters are trained using 128 A100 GPU with 80GB See Table 10 for the details of all the training runs. This paper presents "Meet in the Middle", a novel pretraining paradigm for language models that The proposed method's secondary benefits in the infilling task could also improve several NLP tasks, such as text summarization and question answering, leading to better usability and overall



AUTOMATA: Gradient Based Data Subset Selection for Compute-Efficient Hyper-parameter Tuning

Neural Information Processing Systems

Deep neural networks have seen great success in recent years; however, training a deep model is often challenging as its performance heavily depends on the hyper-parameters used. In addition, finding the optimal hyper-parameter configuration, even with state-of-the-art (SOTA) hyper-parameter optimization (HPO) algorithms, can be time-consuming, requiring multiple training runs over the entire datasetfor different possible sets of hyper-parameters. Our central insight is that using an informative subset of the dataset for model training runs involved in hyper-parameter optimization, allows us to find the optimal hyper-parameter configuration significantly faster. In this work, we propose AUTOMATA, a gradient-based subset selection framework for hyper-parameter tuning. We empirically evaluate the effectiveness of AUTOMATA in hyper-parameter tuning through several experiments on real-world datasets in the text, vision, and tabular domains. Our experiments show that using gradient-based data subsets for hyper-parameter tuning achieves significantly faster turnaround times and speedups of 3 -30 while achieving comparable performance to the hyper-parameters found using the entire dataset.


Analog Physical Systems Can Exhibit Double Descent

Dillavou, Sam, Rocks, Jason W, Wycoff, Jacob F, Liu, Andrea J, Durian, Douglas J

arXiv.org Artificial Intelligence

An important component of the success of large AI models is double descent, in which networks avoid overfitting as they grow relative to the amount of training data, instead improving their performance on unseen data. Here we demonstrate double descent in a decentralized analog network of self-adjusting resistive elements. This system trains itself and performs tasks without a digital processor, offering potential gains in energy efficiency and speed -- but must endure component non-idealities. We find that standard training fails to yield double descent, but a modified protocol that accommodates this inherent imperfection succeeds. Our findings show that analog physical systems, if appropriately trained, can exhibit behaviors underlying the success of digital AI. Further, they suggest that biological systems might similarly benefit from over-parameterization.